We needed to change our data into a different type, and, add some columns, manipulate the data, visualize and test results.
- python, numpy, scipy and pandas:
- filtering, grouping, sorting
- aggregations, descriptive stats
- tests for statistical analysis
- feature design:
- year (2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)
- month (numerical; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
- month (name: Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sept, Oct, Nov, Dec
- weekday (numerical: 0, 1, 2, 3, 4, 5, 6)
- weekday (name: Mon, Tues, Wed, Thurs, Fri, Sat, Sun)
- quarter (financial: 1, 2, 3, 4)
- visualize discrete features (new columns) as a y-axis variables:
- Matplotlib and Seaborn libraries
- Histograms, Bar charts, Line Charts
stats1 = df.groupby(['weekday_name'], as_index=False).agg({'data_science': "mean"})
stats2 = stats = df.groupby(['weekday_name'], as_index=False).agg({'data_scientist': "mean"})
# print(round(stats1.sort_values('data_science'),2))
plt.figure(figsize=(15,16))
stats1.hist(bins=50)
plt.title('Distribution of Weekday Averages - Data Science\n')
plt.show();
<Figure size 1080x1152 with 0 Axes>
# print(round(stats2.sort_values('data_scientist'),2));
plt.figure(figsize=(20,20))
stats2.hist(bins=50);
plt.title('Distribution of Weekday Averages - Data Scientist')
plt.show();
<Figure size 1440x1440 with 0 Axes>
stats = df.groupby(['weekday_name'], as_index=False).agg({'data_scientist': "mean", 'data_science': "mean"})
stats = pd.DataFrame(stats)
# stats = stats.sort_values('weekday_name')
x = stats['data_science'].sort_values()
x1 = stats['data_scientist'].sort_values()
mean1 = x.mean()
y = stats['weekday_name'].sort_values()
plt.figure(figsize=(17,6))
sns.barplot(x=x, y=y, data=stats)
plt.title('\nAverage Interest Over Time, By Weekday: 2012-2019\nKeyword Term: Data Science (US)\n\n (n=96\nmean=42.8129\nmedian=43.53\nstdv=1.85\nmin=40.00\nmax=44.846)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
plt.axvline(x=mean1, color='black', linestyle='--')
plt.axvline(x=x.median(), color='blue', linestyle='--')
plt.axvline(x=x.std(), color='yellow', linestyle='--')
plt.axvline(x=x.max(), color='red', linestyle='--')
plt.axvline(x=x.min(), color='cyan', linestyle='--')
plt.show()
# print(x.median())
# print(x.sort_values())
# stats['data_science'].mean() 42.8129
# stats['data_scientist'].mean() 46.5744
# print(x.min())
# print(x.max())
stats = df.groupby(['weekday_name'], as_index=False).agg({'data_scientist': "mean", 'data_science': "mean"})
stats = pd.DataFrame(stats)
# stats = stats.sort_values('weekday_name')
x = stats['data_science'].sort_values()
x1 = stats['data_scientist'].sort_values()
mean1 = x.mean()
mean2 = x1.mean()
plt.figure(figsize=(17,6))
sns.barplot(x=x1, y='weekday_name', data=stats)
plt.title('\nAverage Interest Over Time, By Weekday: 2012-2019\nKeyword Term: Data Scientist (US)\n\n (n=96\nmean=46.5744\nmedian=46.0769\nstdv=1.85\nmin=43.92\nmax=49.23)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
plt.axvline(x=mean2, color='black', linestyle='--')
plt.axvline(x=x1.median(), color='blue', linestyle='--')
plt.axvline(x=x1.std(), color='yellow', linestyle='--')
plt.axvline(x=x1.max(), color='red', linestyle='--')
plt.axvline(x=x1.min(), color='cyan', linestyle='--')
# plt.legend(('mean', 'median', 'stdv', 'max', 'min'), loc='center')
plt.show()
# print(x1.median())
# print(x.sort_values())
# stats['data_science'].mean() 42.8129
# stats['data_scientist'].mean() 46.5744
from scipy import stats
plt.figure(figsize=(17,6))
plt.hist(df['data_science'], alpha=.5, bins=20, color='purple');
plt.hist(df['data_scientist'], alpha=.3, bins=20);
plt.xlabel('Interest Level')
plt.ylabel('# of Occurences')
plt.legend(['Data Science', 'Data Scientist'])
plt.axvline(df['data_science'].mean(), color='red', linestyle='-')
plt.axvline(df['data_scientist'].mean(), color='teal', linestyle='-')
plt.title('Distribution of Interest Level\nn=96\n')
plt.show()
mean1 = df['data_science'].mean()
# print(mean1)
# plt.hist(df['computer'], alpha=.5, bins=50)
# plt.hist(df['love'], alpha=.5, bins=50)
# plt.show()
# print(df.shape)
And, if we wanna test our hypothesis, we'll need more data!
But, let's see some more charts to highlight relationships. Let's look at Averages By Month--
# mean_by_quarter = df.groupby(['quarter']).mean().reset_index()
# mean_by_quarter = mean_by_month.sort_values('quarter')
# mean_by_quarter['data_science'] = abs(mean_by_month['data_science'])
# f, ax = plt.subplots(figsize=(15,6))
# sns.set_style("darkgrid")
# # {"xtick.major.size":18,
# # "ytick.major.size":18})
# sns.barplot(x='data_science', y='quarter', data=mean_by_month)
# plt.title('\nAverage Interest Over Time, By Quarter: 2012-2019\nKeyword Term: Data Science (US)\n')
# plt.ylabel("")
# plt.xlabel("\nInterest Level (0-100)")
# plt.show()
mean_by_month = df.groupby(['month_name', 'month']).mean().reset_index()
mean_by_month = mean_by_month.sort_values('data_scientist')
mean_by_month['data_scientist'] = abs(mean_by_month['data_scientist'])
f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
# {"xtick.major.size":18,
# "ytick.major.size":18})
sns.barplot(x='data_scientist', y='month_name', data=mean_by_month)
plt.title('\nAverage Interest Over Time, By Month: 2012-2019\nKeyword Term: Data Scientist (US)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
mean = mean_by_month['data_scientist'].mean()
median = mean_by_month['data_scientist'].median()
plt.axvline(x=mean, color='black', linestyle='--')
plt.axvline(x=median, color='blue', linestyle='--')
plt.show()
mean_by_month = df.groupby(['month_name', 'month']).mean().reset_index()
mean_by_month = mean_by_month.sort_values('data_science')
mean_by_month['data_science'] = abs(mean_by_month['data_science'])
f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
# {"xtick.major.size":18,
# "ytick.major.size":18})
sns.barplot(x='data_science', y='month_name', data=mean_by_month)
plt.title('\nAverage Interest Over Time, By Month: 2012-2019\nKeyword Term: Data Science (US)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
mean = mean_by_month['data_science'].mean()
median = mean_by_month['data_science'].median()
plt.axvline(x=mean, color='black', linestyle='--')
plt.axvline(x=median, color='blue', linestyle='--')
plt.show()
# df['ds_all'] = (df['data_science'] + df['data_scientist']
df['ds_all'] = (df['data_science'] + df['data_scientist']) / 2
mean_by_month = df.groupby(['month_name', 'month']).mean().reset_index()
mean_by_month = mean_by_month.sort_values('ds_all')
mean_by_month['ds_all'] = abs(mean_by_month['ds_all'])
f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
# {"xtick.major.size":18,
# "ytick.major.size":18})
sns.barplot(x='ds_all', y='month_name', data=mean_by_month)
plt.title('\nAverage Interest Over Time, By Month: 2012-2019\nKeyword Term: Data Scientist & Data Science (US)\n')
plt.ylabel("")
plt.xlabel("\nInterest Level (0-100)")
mean = mean_by_month['ds_all'].mean()
median = mean_by_month['ds_all'].median()
plt.axvline(x=mean, color='black', linestyle='--')
plt.axvline(x=median, color='blue', linestyle='--')
plt.show()
mean_by_month.sort_values('ds_all')
| month_name | month | data_scientist | data_science | computer | love | year | weekday | quarter | ds_all | |
|---|---|---|---|---|---|---|---|---|---|---|
| 5 | July | 7 | 43.250 | 35.500 | 69.625 | 73.500 | 2015.5 | 3.000 | 3.0 | 39.3750 |
| 6 | June | 6 | 42.875 | 36.125 | 68.625 | 72.000 | 2015.5 | 3.625 | 2.0 | 39.5000 |
| 7 | March | 3 | 44.125 | 39.750 | 73.375 | 68.875 | 2015.5 | 3.500 | 1.0 | 41.9375 |
| 3 | February | 2 | 43.750 | 40.375 | 75.750 | 67.000 | 2015.5 | 3.250 | 1.0 | 42.0625 |
| 2 | December | 12 | 43.250 | 41.750 | 73.875 | 69.625 | 2015.5 | 3.750 | 4.0 | 42.5000 |
| 8 | May | 5 | 46.000 | 39.375 | 69.750 | 71.250 | 2015.5 | 2.375 | 2.0 | 42.6875 |
| 4 | January | 1 | 45.000 | 42.125 | 78.625 | 67.000 | 2015.5 | 2.875 | 1.0 | 43.5625 |
| 0 | April | 4 | 45.250 | 42.000 | 71.375 | 69.875 | 2015.5 | 3.000 | 2.0 | 43.6250 |
| 9 | November | 11 | 45.750 | 45.375 | 73.250 | 70.875 | 2015.5 | 3.500 | 4.0 | 45.5625 |
| 1 | August | 8 | 51.375 | 46.250 | 74.750 | 71.750 | 2015.5 | 2.500 | 3.0 | 48.8125 |
| 10 | October | 10 | 52.625 | 50.000 | 72.000 | 70.500 | 2015.5 | 2.250 | 4.0 | 51.3125 |
| 11 | September | 9 | 55.500 | 55.125 | 76.500 | 70.625 | 2015.5 | 3.750 | 3.0 | 55.3125 |
mean_by_year = df.groupby(['year', 'data_scientist']).mean().reset_index()
# mean_by_year = mean_by_year.sort_values('year')
mean_by_year['data_scientist'] = abs(mean_by_year['data_scientist'])
f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
# {"xtick.major.size":18,
# "ytick.major.size":18})
sns.lineplot(x='year', y='data_scientist', data=mean_by_year)
sns.set_palette("husl",3)
plt.title('\nAverage Interest Over Time 2012-2019\nKeyword Term: Data Scientist (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()
mean_by_year = df.groupby(['year', 'data_science']).mean().reset_index()
# mean_by_year = mean_by_year.sort_values('year')
mean_by_year['data_science'] = abs(mean_by_year['data_science'])
f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
# {"xtick.major.size":18,
# "ytick.major.size":18})
sns.lineplot(x='year', y='data_science', data=mean_by_year)
sns.set_palette("husl",3)
plt.title('\nAverage Interest Over Time 2012-2019\nKeyword Term: Data Science (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()
mean_by_year = df.groupby(['year', 'ds_all']).mean().reset_index()
# mean_by_year = mean_by_year.sort_values('year')
mean_by_year['ds_all'] = abs(mean_by_year['ds_all'])
f, ax = plt.subplots(figsize=(15,6))
sns.set_style("darkgrid")
# {"xtick.major.size":18,
# "ytick.major.size":18})
sns.lineplot(x='year', y='ds_all', data=mean_by_year)
sns.set_palette("husl",3)
plt.title('\nAverage Interest Over Time, By Year: 2012-2019\nKeyword Term: Data Scientist & Data Science (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()
sns.set(style="ticks")
plt.figure(figsize=(17,10))
# rs = np.random.seed(0)
x = df['year']
y = df['ds_all']
sns.jointplot(x, y, kind="hex",
data=df,
color="cyan");
plt.title('\nAverage Interest Over Time, By Year: 2012-2019\nKeyword Term: Data Scientist & Data Science (US)\n')
plt.ylabel("Interest Level")
plt.xlabel("\nTime in Years")
plt.show()
<Figure size 1224x720 with 0 Axes>
Here are our stats for Group A:
count 106.000000
mean 2.792453
std 2.728045
min 0.000000
25% 1.000000
50% 2.000000
75% 3.000000
max 17.000000
Here are our stats for Group B:
count 86.000000
mean 50.441860
std 25.564039
min 10.000000
25% 27.500000
50% 47.500000
75% 75.500000
max 100.000000
plt.figure(figsize=(17,10))
plt.title ('Statistics for Interest Levels in Keyword "Data Scientist"\n2004 - 2011 vs. 2012 - 2019\n')
group_a['data scientist: (United States)'].plot.box(meanline=True);
plt.show()
plt.figure(figsize=(17,10))
group_b['data scientist: (United States)'].plot.box(meanline=True);
plt.show()
1) in the first boxplot we see a lot of outliers!
2) In our second box plot, we see no outliers.
plt.figure(figsize=(20,10))
plt.hist(group_a['data scientist: (United States)'], bins=25, alpha= .5, color='purple')
plt.hist(group_b['data scientist: (United States)'], bins=25, alpha= .5, color='cyan')
# ax.vlines(x=group_a['data scientist: (United States)'].mean(), color='black', linestyle='--')
plt.title('Distribution of Interest Level:\n2004-2019\nGroup A & Group B\n Group A mean=2.79\n Group B mean=50.44')
plt.legend(labels=['Group A: Before HBR article','Group B: After HBR article'],
loc='right',
handlelength=5,
borderpad=3, labelspacing=5)
plt.show()
data_nobs = len(group_a['data scientist: (United States)'])
data_mean = group_a['data scientist: (United States)'].mean()
data_min = group_a['data scientist: (United States)'].min()
data_max = group_a['data scientist: (United States)'].max()
data_var = group_a['data scientist: (United States)'].var()
data_skew = group_a['data scientist: (United States)'].skew()
data_kurtosis = group_a['data scientist: (United States)'].kurtosis()
print("Group A: Pre-Harvard Business Review Article - Jan 1 2004 - Oct 1 2012\n\n")
print("N (Sample Size): {}".format(round(data_nobs,2)))
print("Mean: {}".format(round(data_mean,2)))
print("Min: {}".format(round(data_min,2)))
print("Max: {}".format(round(data_max,2)))
print("Variance: {}".format(round(data_var,2)))
print("Skewness: {}".format(round(data_skew,2)))
print("Kurtosis: {}".format(round(data_kurtosis,2)))
Group A: Pre-Harvard Business Review Article - Jan 1 2004 - Oct 1 2012 N (Sample Size): 106 Mean: 2.79 Min: 0 Max: 17 Variance: 7.44 Skewness: 2.32 Kurtosis: 7.67
A Shapiro-Wilk test confirms that unfortunately, if we were to use the raw Google Trends data to answer our original questions, the likelihood of our observations reflecting in truth in the general population, that our data is showing a statistically significant increase in trend is <1%.
p-value = 2.212...e-11
- Based on this research, it's more clear that relying on raw data from Google Trends is not a scientifically-backed decision-maker
Opportunities for further research
Biases
Shifts in Perspectives
Sources:
Raw Data - Interest Level over Time:
"Google Trends Data - Search Term: Data Scientist, January 1, 2004 - December 1, 2019; Interest over Time (https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20scientist)
"Google Trends Data - Search Term: Data Science, January 1, 2004 - December 1, 2019; Interest over Time (https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20science)
"Google Trends Data - Search Term: Data Science vs. Data Scientist, January 1, 2004 - December 1, 2019; Interest over Time" (https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20science,data%20scientist)
"1Biocognitive https://www.biocognitive.com/index.php/philosophy/page.html)
"2Quantization" https://www.sciencedirect.com/topics/engineering/quantisation
GIS data:
"Google Trends Data - Search Term: Data Scientist, January 1, 2004 - December 1, 2019; Interest by SubRegion https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20scientist
"Google Trends Data - Search Term: Data Science, January 1, 2004 - December 1, 2019; Interest by SubRegion https://trends.google.com/trends/explore?date=2004-01-01%202019-12-01&geo=US&q=data%20science